class: center, middle, inverse, title-slide # Databases ### Kirill & Nicolas ### cynkra GmbH ### March 15, 2022 --- <style type="text/css"> .pull-left { margin-top: -25px; } .pull-right { margin-top: -25px; } .remark-code { font-size: 14px; } .font17 { font-size: 17px; } .font14 { font-size: 14px; } </style> --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - **Joins** - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/21.webp") background-size: 40% background-position: 100% 100% # Joins .pull-left[ - Usage - Join sources - Mounting Script: `databases_21.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins .pull-left[ - Usage - Join sources - Mounting Script: `databases_21.R` ```r library(tidyverse) ``` ] --- # A second table .pull-left[ ```r academy <- tbl(con_duckdb, "academy") academy_sqlite <- tbl(con_sqlite, "academy") academy %>% count(status) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> Award not yet introduced 3 <span style='color: #BCBCBC;'>2</span> Nominated 36 <span style='color: #BCBCBC;'>3</span> Ineligible 23 <span style='color: #BCBCBC;'>4</span> Won Special Achievement 1 <span style='color: #BCBCBC;'>5</span> Won 17 </CODE></PRE> ] .pull-right[ ```r academy ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<academy> [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 4</span> Toy Story Original Score Nominated <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nominated <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won Special Achievement <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Feature Award not yet introduced <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Screenplay Ineligible <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Score Nominated <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Feature Award not yet introduced <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Left join The most frequent kind of join. .pull-left[ ## Unsafe ```r academy %>% left_join(pixar_films) ``` <PRE class="fansi fansi-message"><CODE>Joining, by = "film" </CODE></PRE><PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fea… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Scr… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scre… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sco… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fea… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scre… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sco… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fea… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## Better ```r academy %>% left_join(pixar_films, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fea… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Scr… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scre… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sco… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fea… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scre… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sco… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fea… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Left join Computed on the database, original data unchanged. ```r academy %>% left_join(pixar_films, by = "film") %>% show_query() ``` ``` <SQL> SELECT "LHS"."film" AS "film", "award_type", "status", "number", "release_date", "run_time", "film_rating" FROM "academy" AS "LHS" LEFT JOIN "pixar_films" AS "RHS" ON ("LHS"."film" = "RHS"."film") ``` --- # Join with preparation The right-hand side in the join should come from a variable. .pull-left[ ## Prepare RHS ```r academy_won <- academy %>% filter(status == "Won") %>% count(film, name = "n_won") academy_won ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 2]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> Monsters, Inc. 1 <span style='color: #BCBCBC;'> 2</span> Finding Nemo 1 <span style='color: #BCBCBC;'> 3</span> The Incredibles 2 <span style='color: #BCBCBC;'> 4</span> Ratatouille 1 <span style='color: #BCBCBC;'> 5</span> WALL-E 1 <span style='color: #BCBCBC;'> 6</span> Up 2 <span style='color: #BCBCBC;'> 7</span> Toy Story 3 2 <span style='color: #BCBCBC;'> 8</span> Brave 1 <span style='color: #BCBCBC;'> 9</span> Inside Out 1 <span style='color: #BCBCBC;'>10</span> Coco 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ```r pixar_films %>% left_join(academy_won, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 2</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 3</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 4</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 5</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'> 6</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #BCBCBC;'> 7</span> 11 Toy Story 3 2010-06-18 103 G 2 <span style='color: #BCBCBC;'> 8</span> 13 Brave 2012-06-22 93 PG 1 <span style='color: #BCBCBC;'> 9</span> 15 Inside Out 2015-06-19 95 PG 1 <span style='color: #BCBCBC;'>10</span> 19 Coco 2017-11-22 105 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Join with postprocessing .pull-left[ ## Raw result ```r pixar_films %>% left_join(academy_won, by = "film") %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G <span style='color: #BB0000;'>NA</span> <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] .pull-right[ ## After postprocessing ```r pixar_films %>% left_join(academy_won, by = "film") %>% mutate(n_won = coalesce(n_won, 0L)) %>% arrange(release_date) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 6]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='color: #949494;'># Ordered by: release_date</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>release_date</span> <span style='font-weight: bold;'>run_time</span> <span style='font-weight: bold;'>film_rating</span> <span style='font-weight: bold;'>n_won</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'> 1</span> 1 Toy Story 1995-11-22 81 G 0 <span style='color: #BCBCBC;'> 2</span> 2 A Bug's Life 1998-11-25 95 G 0 <span style='color: #BCBCBC;'> 3</span> 3 Toy Story 2 1999-11-24 92 G 0 <span style='color: #BCBCBC;'> 4</span> 4 Monsters, Inc. 2001-11-02 92 G 1 <span style='color: #BCBCBC;'> 5</span> 5 Finding Nemo 2003-05-30 100 G 1 <span style='color: #BCBCBC;'> 6</span> 6 The Incredibles 2004-11-05 115 PG 2 <span style='color: #BCBCBC;'> 7</span> 7 Cars 2006-06-09 117 G 0 <span style='color: #BCBCBC;'> 8</span> 8 Ratatouille 2007-06-29 111 G 1 <span style='color: #BCBCBC;'> 9</span> 9 WALL-E 2008-06-27 98 G 1 <span style='color: #BCBCBC;'>10</span> 10 Up 2009-05-29 96 PG 2 <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Join with processing Computed on the database, original data unchanged. ```r pixar_films %>% left_join(academy_won, by = "film") %>% mutate(n_won = coalesce(n_won, 0L)) %>% arrange(release_date) %>% show_query() ``` ``` <SQL> SELECT "number", "film", "release_date", "run_time", "film_rating", COALESCE("n_won", 0) AS "n_won" FROM (SELECT "number", "LHS"."film" AS "film", "release_date", "run_time", "film_rating", "n_won" FROM "pixar_films" AS "LHS" LEFT JOIN (SELECT "film", COUNT(*) AS "n_won" FROM "academy" WHERE ("status" = 'Won') GROUP BY "film") "RHS" ON ("LHS"."film" = "RHS"."film") ) "q01" ORDER BY "release_date" ``` --- # Tables must be on the same source Use `copy = TRUE` to enforce, the result is a lazy table if the LHS is a lazy table. .pull-left[ ## Bad ```r try( academy %>% left_join(pixar_films_sqlite, by = "film") ) ``` <PRE class="fansi fansi-output"><CODE>Error in auto_copy(x, y, copy = copy, indexes = if (auto_index) list(by$y)) : `x` and `y` must share the same src. <span style='color: #00BBBB;'>ℹ</span> set `copy` = TRUE (may be slow). </CODE></PRE> ] .pull-right[ ## Not too bad ```r academy %>% left_join(pixar_films_sqlite, by = "film", copy = TRUE) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fea… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Scr… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scre… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sco… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fea… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scre… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sco… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fea… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with more rows</span> </CODE></PRE> ] --- # Copying is expensive! A temporary table is created on the LHS database. If the RHS comes from a different database, results are temporarily loaded into the local session! ```r academy %>% left_join(pixar_films_sqlite, by = "film", copy = TRUE) %>% show_query() ``` ``` <SQL> SELECT "LHS"."film" AS "film", "award_type", "status", "number", "release_date", "run_time", "film_rating" FROM "academy" AS "LHS" LEFT JOIN "dbplyr_002" AS "RHS" ON ("LHS"."film" = "RHS"."film") ``` --- # Joining data frames with lazy tables The result is a data frame too. .pull-left[ ## Bad ```r try( pixarfilms::academy %>% left_join(pixar_films, by = "film") ) ``` <PRE class="fansi fansi-output"><CODE>Error in auto_copy(x, y, copy = copy) : `x` and `y` must share the same src. <span style='color: #00BBBB;'>ℹ</span> set `copy` = TRUE (may be slow). </CODE></PRE> ] .pull-right[ ## Could be worse ```r pixarfilms::academy %>% left_join(pixar_films, by = "film", copy = TRUE) ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># A tibble: 80 × 7</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'> 1</span> Toy Story Animated Fea… Award… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 2</span> Toy Story Original Scr… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 3</span> Toy Story Adapted Scre… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 4</span> Toy Story Original Sco… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 5</span> Toy Story Original Song Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 6</span> Toy Story Other Won S… 1 1995-11-22 81 G <span style='color: #BCBCBC;'> 7</span> A Bug's Life Animated Fea… Award… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 8</span> A Bug's Life Adapted Scre… Ineli… 2 1998-11-25 95 G <span style='color: #BCBCBC;'> 9</span> A Bug's Life Original Sco… Nomin… 2 1998-11-25 95 G <span style='color: #BCBCBC;'>10</span> Toy Story 2 Animated Fea… Award… 3 1999-11-24 92 G <span style='color: #949494;'># … with 70 more rows</span> </CODE></PRE> ] --- # DuckDB: register data frames as database tables Temporarily use a local data frame as a table. Also works for Arrow datasets via `duckdb::duckdb_register_arrow()`. .pull-left[ ## Register and access ```r duckdb::duckdb_register( con_duckdb, "academy_small", pixarfilms::academy[1:3, ] ) academy_small <- tbl(con_duckdb, "academy_small") academy_small ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: table<academy_small> [?? x 3]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Toy Story Animated Feature Award not yet introduced <span style='color: #BCBCBC;'>2</span> Toy Story Original Screenplay Nominated <span style='color: #BCBCBC;'>3</span> Toy Story Adapted Screenplay Ineligible </CODE></PRE> ] .pull-right[ ## Use ```r academy_small %>% left_join(pixar_films, by = "film") ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 7]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>film</span> <span style='font-weight: bold;'>award_type</span> <span style='font-weight: bold;'>status</span> <span style='font-weight: bold;'>number</span> <span style='font-weight: bold;'>release_d…</span> <span style='font-weight: bold;'>run_t…</span> <span style='font-weight: bold;'>film_…</span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #949494; font-style: italic;'><date></span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #949494; font-style: italic;'><chr></span> <span style='color: #BCBCBC;'>1</span> Toy Story Adapted Screenpl… Ineli… 1 1995-11-22 81 G <span style='color: #BCBCBC;'>2</span> Toy Story Original Screenp… Nomin… 1 1995-11-22 81 G <span style='color: #BCBCBC;'>3</span> Toy Story Animated Feature Award… 1 1995-11-22 81 G </CODE></PRE> ] --- # DuckDB: Performance comparison Baseline: Data frames. .pull-left[ ```r nrow(nycflights13::flights) ``` ``` [1] 336776 ``` ] .pull-right[ ```r system.time( nycflights13::flights %>% count(year, month, day) ) ``` ``` user system elapsed 0.01 0.00 0.01 ``` ] --- # DuckDB: Performance comparison With registration. .pull-left[ ```r system.time(duckdb::duckdb_register( con_duckdb, "flights", nycflights13::flights )) ``` ``` user system elapsed 0.003 0.001 0.003 ``` ```r flights_register <- tbl(con_duckdb, "flights") flights_register %>% count() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 1]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> <span style='text-decoration: underline;'>336</span>776 </CODE></PRE> ] .pull-right[ ```r system.time( flights_register %>% count(year, month, day) %>% collect() ) ``` ``` user system elapsed 0.031 0.000 0.030 ``` ] --- # DuckDB: Performance comparison With copy. .pull-left[ ```r system.time( flights_copy <- copy_to(con_duckdb, nycflights13::flights) ) ``` ``` user system elapsed 0.087 0.008 0.095 ``` ```r flights_copy %>% count() ``` <PRE class="fansi fansi-output"><CODE><span style='color: #949494;'># Source: lazy query [?? x 1]</span> <span style='color: #949494;'># Database: duckdb_connection</span> <span style='font-weight: bold;'>n</span> <span style='color: #949494; font-style: italic;'><dbl></span> <span style='color: #BCBCBC;'>1</span> <span style='text-decoration: underline;'>336</span>776 </CODE></PRE> ] .pull-right[ ```r system.time( flights_copy %>% count(year, month, day) %>% collect() ) ``` ``` user system elapsed 0.008 0.000 0.006 ``` ] --- # ETL, revisited Insert a second table into our database. ```r db_path <- fs::path_abs("pixar.duckdb") con <- DBI::dbConnect(duckdb::duckdb(dbdir = db_path)) DBI::dbWriteTable(con, "academy", pixarfilms::academy, overwrite = TRUE) DBI::dbExecute(con, "CREATE UNIQUE INDEX academy_pk ON academy (film, award_type)") ``` ``` [1] 0 ``` ```r DBI::dbExecute(con, "CREATE INDEX academy_fk ON academy (film)") ``` ``` [1] 0 ``` ```r DBI::dbDisconnect(con) ``` --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins: Exercises 1 .pull-left[ 1. How many rows does the join between `academy` and `pixar_films` contain? Try to find out without loading all the data into memory. Explain. 2. Which films are not yet listed in the `academy` table? What does the resulting SQL query look like? - Hint: Use `anti_join()` ] --- background-image: url("data:image/png;base64,#images/21-frame.webp") background-size: 40% background-position: 100% 100% # Joins: Exercises 2 .pull-left[ 3. Transform `academy` into a wide table so that there is at most one row per film. Join the resulting table with the `pixar_films` table. - Hint: Use `pivot_wider()`, `spread()`, `dcast()`, ... . You need to compute locally, because these functions don't work on the database. ] .pull-right[ 4. Plot a bar chart with the number of awards won and nominated per year. Compute as much as possible on the database. - Hint: "Long form" or "wide form"? ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - **The {dm} package** - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/22.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Compound object for multiple tables - Features Script: `databases_22.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/22-frame.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Compound object for multiple tables - Features Script: `databases_22.R` ```r library(tidyverse) ``` ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - **A bit of theory** - Playing the whole game ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/23.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Keys, relationships, constraints - Zooming Script: `databases_23.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/23-frame.webp") background-size: 40% background-position: 100% 100% # Data models .pull-left[ - Keys, relationships, constraints - Zooming Script: `databases_23.R` ```r library(tidyverse) ``` ] --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - **Playing the whole game** ] ## Bonus: Bring your own data + questions --- background-image: url("data:image/png;base64,#images/24.webp") background-size: 40% background-position: 100% 100% # The whole game .pull-left[ - Build a local data model - Copy it to the database - Consume it Script: `databases_24.R` ```r library(tidyverse) ``` ] --- background-image: url("data:image/png;base64,#images/24-frame.webp") background-size: 40% background-position: 100% 100% # The whole game .pull-left[ - Build a local data model - Copy it to the database - Consume it Script: `databases_24.R` ```r library(tidyverse) ``` ] --- # Recap <table> <tr> <td rowspan=2> <img src="data:image/png;base64,#images/11-frame.webp" width="200px" /> </td> <td><img src="data:image/png;base64,#images/12-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/12_2-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/13-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/14-frame.webp" width="200px" /></td> </tr> <tr style="background:transparent"> <td><img src="data:image/png;base64,#images/21-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/22-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/23-frame.webp" width="200px" /></td> <td><img src="data:image/png;base64,#images/24-frame.webp" width="200px" /></td> </tr> </table> --- # Schedule Extract ⇨ Transform ⇨ Load ⇨ Consume .pull-left[ ## One table - Read whole tables - Let the database do the heavy lifting - Basic ETL for one table - Subtle issues to watch out for ] .pull-right[ ## Multiple tables - Joins - The {dm} package - A bit of theory - Playing the whole game ] ## Bonus: Bring your own data + questions